Build Brown Corpus


In [1]:
from gensim import corpora, models, similarities
from nltk.corpus import brown
import nltk
import re

In [2]:
dictb = corpora.Dictionary([[word.lower() for word in sent]
                            for sent in brown.sents()])
print dictb


Dictionary(49815 unique tokens: [u'fawn', u'belligerence', u'mid-week', u'1,800', u'deferment']...)

In [3]:
stopid = [dictb.id2token[w] for w in nltk.corpus.stopwords.words('english')
          if w in dictb.keys()]
onceid = [id for id, freq in dictb.dfs.iteritems() if freq == 1]
numberid = [id for id, word in dictb.items() if any(char.isdigit() for char in word)]
print len(stopid + onceid + numberid)


24477

In [4]:
dictb.filter_tokens(stopid + onceid + numberid)
dictb.compactify()
print dictb


Dictionary(26840 unique tokens: [u'woods', u'francesco', u'francesca', u'comically', u'over/under']...)

In [5]:
corpus = [dictb.doc2bow([w.lower() for w in sent]) for sent in brown.sents()]

In [6]:
corpora.MmCorpus.serialize('brown.mm', corpus)
dictb.save('brown.dict')

Transformation

一般有兩個步驟

  1. 第一步,利用traning corpus產生一個model
  2. 第二步,使用model轉換corpus或未來的test data

In [7]:
tfidf = models.TfidfModel(corpus)  # 第一步,產生model

In [8]:
tfidf[corpus[0]]  # 第二步,轉換vector


Out[8]:
[(251, 0.0054366210864968635),
 (2769, 0.2573296526826243),
 (4196, 0.23146850516068368),
 (5258, 0.24089333580097672),
 (5893, 0.24971487236866738),
 (7094, 0.07203518732269294),
 (7415, 0.10101429608144462),
 (7755, 0.20931521918529866),
 (8505, 0.3214467806899905),
 (8585, 0.018273466302608996),
 (8666, 0.20489153870964272),
 (10023, 0.2162871901775302),
 (11866, 0.178026045458523),
 (12102, 0.1208869196608621),
 (12327, 0.13821665373548117),
 (12505, 0.03230996681149036),
 (13790, 0.24503612567300428),
 (13905, 0.2543720975648415),
 (18918, 0.16741563612639151),
 (19791, 0.07173175044792364),
 (21390, 0.2346191803192959),
 (22184, 0.29868491027427885),
 (22278, 0.06632764794493408),
 (22921, 0.34654558591074425),
 (23566, 0.12259937792129812)]

注意: 呼叫model[corpus]是即時運算,結果不會保存,所以每次所需的時間都相同。如果轉換很花時間,你需要將轉換的結果存到硬碟再用streaming讀取。

不同轉換之間可以串接,例如先作TF-IDF再作LSI。


In [11]:
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

In [12]:
lsi = models.LsiModel(corpus_tfidf, num_topics=200, id2word=dictb)
corpus_lsi = lsi[corpus_tfidf]

In [13]:
lsi.print_topics(5, 8)  # 列出前5名的topics


Out[13]:
[u'0.346*"?" + 0.240*"," + 0.221*"\'\'" + 0.211*"``" + 0.197*"the" + 0.182*"to" + 0.179*"of" + 0.170*"a"',
 u'-0.839*"?" + 0.166*";" + -0.146*"\'\'" + -0.127*"``" + -0.124*"you" + 0.123*"the" + 0.122*"," + 0.118*"of"',
 u'-0.922*";" + 0.182*"!" + -0.167*"?" + 0.143*"." + 0.132*"\'\'" + 0.119*"``" + 0.086*"i" + 0.058*"said"',
 u'-0.709*"!" + -0.354*"\'\'" + -0.335*"``" + -0.271*";" + 0.212*"?" + -0.146*"i" + -0.112*"said" + 0.087*"of"',
 u'-0.977*"." + -0.141*";" + 0.041*"," + 0.040*"of" + 0.039*"the" + 0.038*"!" + -0.037*"``" + 0.034*"to"']

Save and Load


In [14]:
tfidf.save('model.tfidf')
tfidf = models.TfidfModel.load('model.tfidf')

In [15]:
lsi.save('model.lsi')
lsi = models.LsiModel.load('model.lsi')

Other Transformations

Random Projections, RP

是一種接近TF-IDF distance的方法,加上一點隨機成份,無論記憶體或CPU都很有效率。
model = rpmodel.RpModel(tfidf_corpus, num_topics=500)

Latent Dirichlet Allocation, LDA

另一種bag-of-words的計算方法,是LSA(或稱multinomial PCA)的機率版本。
model = ldamodel.LdaModel(bow_corpus, id2word=dictionary, num_topics=100)

Hierarchical Dirichlet Process, HDP

non-parametric bayesian method,不需要指定topic數量。
model = hdpmodel.HdpModel(bow_corpus, id2word=dictionary)